Case Study -- NLP Amazon Reviews (code)

Version 0.1

Simon Yang

last update: 25th September 2021

Import libraries and dependant data

Download Amazon review data (optional)

For this script to be fully self-contained, the data can be pulled from the web, here. For the purpose of this exercise, the data was pulled with wget and uncompressed to './data/'.

Read data

Repackage data into pandas dataframe

First we retrieve all possible unique user key to define our table's columns
Let's fill our table with data

Clean-up data:

  1. remove empty review texts
  2. remove dupplicates

Retreive helpfullness and bin into categories:

Filter for reviews with less than 256 words

Due to computational limitation, we do not consider reviews with more than 256 words.

Word Count Per Review

We now have the following table:

Text processing

We create a function that preprocesses the reviews with the following steps:

Inspect Helpfulness

Using helpulness to filter the training data

  1. use the data as is
  2. filter for helpful reviews where helfulness has been assesed by at least 5 people

Sample data for training and testing (balanced and imbalanced)

Because of ressource limitation we sample our data.

We sample the data without replacement to create a training set and a testing set.

Note that the text processing is performed here with the function defined above

Case 1 -- all the data

Case 2 -- only sample helpfull reviews

Case 3 -- all the data (balanced)

Case 4 -- only sample helpfull reviews (balanced)

Display example training data (Case 1)

Train model

Check if we already have some results already

Function for setting up our model

Function for testing our model

Train multiple instances of the model with various features and parameters and evaluate performance